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Abstract 

Let {(Xi,Yi)} be a stationary ergodic time series with (X,Y) values in the product 
space R d ® R. This study offers what is believed to be the first strongly consistent (with 
respect to pointwise, least-squares, and uniform distance) algorithm for inferring m(x) = 
E[Y Q \X Q = x] under the presumption that m(x) is uniformly Lipschitz continuous. Auto- 
regression, or forecasting, is an important special case, and as such our work extends the 
literature of nonparametric, nonlinear forecasting by circumventing customary mixing 
assumptions. The work is motivated by a time series model in stochastic finance and by 
perspectives of its contribution to the issues of universal time series estimation. 
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1 Introduction 



Nonparametric regression has been applied to a variety of contexts, in particular to time 
series modeling and prediction. The present study contributes to the methodology by 
showing how a regression function can be consistently inferred from time series data 
under no process assumptions beyond stationarity and ergodicity. (A Lipschitz condition 
on the regression function itself will be imposed.) 

Toward showing how our methodology can impinge on an established research area, 
we give one substantive application to a practical problem in stochastic finance: Many 
works, such as the Chapter entitled "Some Recent Developments in Investment Research" 
of the prominent text [5], argue for the need to move beyond the Black-Scholes stochastic 
differential equation. This and other studies suggest the so-called ARCH and GARCH 
extensions as a promising direction. The review of this approach by Bollerslev et al. [6] 
cites a litany of unresolved issues. Of particular relevance is the discussion of the need 
to account for persistency of the variance (Sections 2.6 and 3.6). (ARCH and GARCH 
models can be long-range dependent for certain ranges of parameters. In these cases, 
statistical analysis is delicate [8].) 

The basic idea behind the ARCH/GARCH setup is that one must allow the asset 
volatility (variance) to change dynamically, and perhaps (GARCH) to depend on current 
and past volatility values. The review [6] documents (p. 30) that several authors have 
applied nonparametric and semiparametric regression, with some success, to infer the 
ARCH functions from data. These methods can fail if fairly stringent mixing conditions 
are not in force. Masry and Tjostheim [21], because of their rigorous consideration of 
consistency, sets the stage for appreciating the potential of the present investigation. 
They propose that both the asset dynamics and volatility of a nonlinear ARCH series 
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be inferred from nonparametric classes of regression functions. By imposing some fairly 
severe assumptions, which would be tricky to validate from data, these authors are able 
to assure that the ARCH process is strongly mixing (with exponentially decreasing pa- 
rameter) and consequently standard kernel techniques are applicable. 

On another avenue toward asset series modelling, decades ago, Mandelbrot suggested 
that fractal processes should be considered in this context. Fractals have been of interest 
to theorists and modellers alike in part because they can display persistency. In his 1999 
study, "A Mult if r act al Walk down Wall Street," [20] Mandelbrot argues that conventional 
models for portfolio theory ignore soaring volatility, and that is akin to a mariner ignoring 
the possibility of a typhoon on the basis of the observation that weather is moderate 95% 
of the time. 

Such persistence as exhibited in the models of finance calls into question whether 
various processes of interest are actually strongly mixing, a consistency requirement 
for conventional nonparametric regression techniques. We mention parenthetically that 
telecommunications modelers are increasingly turning toward long-range-dependent pro- 
cesses (e.g., [28] and [37]) 

As mentioned, the primary contribution of the present paper is an algorithm which is 
demonstratably consistent without imposition of mixing assumptions. The implication is 
that process assumptions such as in [21] are not required for our algorithm. The price paid 
for this flexibility is that convergence rates and asymptotic normality cannot be assured. 
This avenue is worthy of exploration, nevertheless, because the limits of process inference 
are clarified, and as a practical matter, future work might lead to methods which are 
reasonably efficient if the process does satisfy mixing assumptions, but simultaneously 
assures convergence when mixing fails. 

The algorithm is of the series-expansion type. The foundational idea (after Kieffer 
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[17]) is that sometimes it is possible to bound the error of ignoring the series tail, and 
additionally assure that the leading coefficients are consistently estimated. Specific con- 
structs are given for a partition-type estimator (Section 2) and for a kernel series (Section 
3). 

We close this introduction with a survey of the literature of nonparametric estimation 
for stationary series without mixing hypotheses. 

Let Y be a real-valued random variable and let X be a d- dimensional random vector 
(i.e., the observation or co-variate). We do not assume anything about the distribution 
of X. As is customary in regression and forecasting, the main aim of the analysis here is 
to minimize the mean-squared error : 

mm E{{f{X)-Y) 2 ) 

over some space of real- valued functions /(•) defined on the range of X. This minimum 
is achieved by the regression function m(x), which is defined to be the conditional distri- 
bution of Y given X: 

m{x) = E(Y | X = x), (1) 

assuming the expectation is well-defined, i.e., if E\Y\ < oo. For each measurable function 
/ one has 

E{{f{X)-Yf) = E{{m{X)-Yf) + E{{m{X)-f{X)f) 

= E((m(X) - Yf) + / {mix) - f(x)) 2 ^dx), 

where \i stands for the distribution of the observation X. The second term on the right 
hand side is called excess error or integrated squared error for the function /, which is 
given the notation 

J{f) = J(m(x)-f(x)) 2 u.idx). (2) 
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Clearly, the mean squared error for / is close to that of the optimal regression function 
only if the excess error J(f) is close to 0. 

With respect to the statistical problem of regression estimation, let (X 1? Yi), . . . , (X n) Y n ) . . . 
be a stationary ergodic time series with marginal component denoted as (X, Y) . We study 
pointwise, L 2 (fj), and convergence of the regression estimate m n to m. The estimator 
m n is called weakly universally consistent if J{m n ) — * in probability for all distributions 
of (X,Y) with _E|F| 2 < oo. In the context of independent identically-distributed (i.i.d.) 
pairs (X, Y), Stone [35] first pointed out in 1977 that there exist weakly universally con- 
sistent estimators. Similarly, m n is called strongly universally consistent if J(m n ) — > 
a.s. for all distributions of (X,Y) with £|Y"| 2 < oo. 

Following pioneering papers by Roussas [31] and Rosenblatt [30], a large body of 
literature has accumulated on consistency and asymptotic normality when the samples 
are correlated. In developments below, we will employ the notation, 

presuming that m < n. 

The theory of nonparametric regression is of significance in time series analysis be- 
cause, by considering samples X n+ i)} in place of the pairs {(X% ,Y n )}, the re- 
gression problem is transformed into the forecasting (or auto-regression) problem. Thus, 
in forecasting, we are asking for the conditional expectation of the next observation, given 
the q— past, with q a positive integer, or perhaps infinity. 

As mentioned, nearly all the works on consistent statistical methods for forecasting 
hypothesize mixing conditions, which are assumptions about how quickly dependency 
attenuates as a function of time separation of the observables. Under a variety of mixing 
assumptions, kernel and partitioning estimators are consistent, and have attractive rate 
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properties. The monograph by Gyorfi et. al. [14] gives a coverage of the literature of 
nonparametric inference for dependent series. In that work, the partition estimator is 
shown to be strongly consistent, provided \Y\ is a.s. bounded, under 0— mixing and, with 
some provisos, under a— mixing. A drawback to much of the literature on nonparametric 
forecasting is that mixing conditions are unverifiable by available statistical procedures. 
Consequently, some investigators have examined the problem: 

Let {Xi} be a real vector-valued stationary ergodic sequence. Find a forecast- 
ing algorithm which is provably consistent in some sense. 

Of course, some additional hypotheses regarding smoothness of the auto-regression 
function and moment properties of the variables will be allowed, but additional assump- 
tions about attenuation of dependency are ruled out. A forecasting algorithm for 

m{XZl) = E[X \XZl\ 

here means a rule giving a sequence {m n } of numbers such that for each n, m n is a 
measureable function determined entirely by the data segment XZn- 

For X binary, Ornstein [27] provided a (complicated) strongly-consistent estimator 

of 

E[Xq\Xz]^[- Algoet [1] extended this approach to achieve convergence over real-valued 
time series and in this and [2], connected the universal forecasting problem with funda- 
mental issues in portfolio and gambling analysis as well as data compression. Morvai et 
al. [22] offered another algorithm achieving strong consistency in the above sense. Their 
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algorithm is easy to describe and analyze, and such analysis shows, unfortunately, that 
its data requirements make it infeasible [23]. 

On the negative side, Bailey [4] and Ryabko [32] have proven that even over binary 
processes, there is no strongly consistent estimator for the dynamic problem of inferring 
E[X n+1 \Xfi,n = 0,1,2,.... 

We mention that for a real vector-valued Markov series with a stationary transition 
law, a strongly-consistent estimator is available for inferring m(x) = E[X \X_i = x] 
under the hypothesis that the sequence is Harris recurrent [38]. Admittedly this is a 
dependency condition, but the marginal (i.e., invariant) law need not exist: Positive 
recurrence is not hypothesized. It is difficult to imagine a Markov condition weaker than 
Harris recurrence under which statistical inference is assured. 

It is to be noted that there are weakly-consistent estimators for the moving regression 
problem i?[X n+1 |X^], n — 0, 1, 2, . . .. It turns out that universal coding algorithms (e.g. 
[39]) of the information theory literature can be converted to weakly-universally consistent 
algorithms when the coordinate space is finite. Morvai et al [25] have given a weakly- 
consistent (and potentially computationally feasible) regression estimator for the moving 
regression problem when X takes values from the set of real numbers. That work offers a 
synopsis of the literature of weakly consistent estimation for stationary and ergodic time 
series. All the studies we have cited on consistency without mixing assumptions rely 
on algorithms which do not fall into any of the traditional classes (partitioning, kernel, 
nearest neighbor) mentioned in connection with i.i.d. regression. 

From this point on, {(Aj, Yi)} will represent a time series with (A, Y) values in R d ® R 
which is stationary and ergodic, and such that E\Yj\ < oo. In Section 2, we establish by 
means of a variation on the partitioning method, that we have a.s. convergence pointwise, 
and, in the case of bounded support, in uniform distance, provided that the regression 
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function m(x) = E[Y \X = x] satisfies a Lipschitz condition and a bound on the Lipschitz 
constant is known in advance. If furthermore \Y\ is known to be bounded (but perhaps 
the bound itself is not known), then our algorithm converges in L 2 (n). Section 3 provides 
analogous results for a truncated kernel-type estimate. In summary, we miss our goal of 
pointwise strong universal consistency only in that we must restrict attention to regression 
functions satisfying a uniform Lipschitz condition and the user must have a bound to the 
Lipschitz constant. /,From counter-examples in Gyorfi et al. [16] one sees that some 
restrictions are needed. 

Recently we have obtained an important preprint by Nobel et al. [26] which bears 
similarities with the present investigation. That study gives an algorithm for the long- 
standing problem of density estimation of the marginal of a stationary sequence. Some- 
what analoguous to our conditions, Nobel at al. require that the density function be of 
bounded variation. The algorithm itself is based on different principles from the present 
paper. In the paper [24] by G. Morvai, S. Kulkarni, and A. Nobel, the ideas in [26] were 
extended for regression estimation. 

2 Truncated partitioning estimation 

Let (XijYi)^ be an ergodic stationary random sequence with E\Y\ < oo. Now we 
attack the problem of estimating the regression function m(x) by combining partitioning 
estimation with a series expansion. 

Let Vk = {A kti i=l,. . . } be a nested cubic partition of R d with volume (2~ k ~ 2 ) d . 
Define A^(x) to be the partition cell of Vk into which x falls. Take 
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One can show that 

M k (x) -> m(x) (4) 

for /i-almost all x G -R d . (To see this, notice that {M k (X), a(A k (X)) k = 1,2,...} is 
a martingale, E\Y\ < oo implies sup fe=1 2 E\M k {X)\ < oo and hence the martingale 
convergence theorem can be applied to achieve the desired result (4), cf. Ash [3] pp. 
292.) 

For k > 2 let 

A k {x) = M k {x)-M k _ l {x). (5) 
Our analysis is motivated by the representation, 

oo 

m(x) = Mi(x) + V A fc (a;) = lim M k (x) 

k=2 k ^°° 

for /i-almost all x G R d . Now let L > be an arbitrary positive number 
k > 2 define 

A k , L {x) = sign{M k {x) - M k ^{x))mm{\M k {x) - M k ^{x)\, LT k ). (7) 

Define 

oo 

m L (x) := M 1 (x) +^A l>L (a;). (8) 

Notice that |Aj ;£; (x)| < L2~\ and hence m^(a;) is well defined for all x G S, where S 
stands for the support of \x defined as 

S := {x G R d : fi(A k (x)) > for all k > 1.} (9) 

By Cover and Hart [7], fi(S) = 1. 



(6) 

For integer 
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The crux of the truncated partitioning estimate is inference of the terms M±(x) and 
A i:L (x) for % = 2, 3, . . . in (8). Define 

:= ^Y^^ 1 - ( 10 ) 
H E"=i l^eA^x)} = 0, then take M k ,n{x) = 0. Now for k > 2, define 

A fc) „,L(a:) = sign(M M (x) - M k _ 1>n (x)) min(|M fc , n (x) - M k -i,n( x )l L2 ~ k ) ( n ) 
and for N n a non-decreasing unbounded sequence of positive integers, define the estimator 

m n , L {x) = M hn (x) + KnA x )- (12) 

fc=2 

Theorem 1 Let {(Xj, Y^)} fre a stationary ergodic time series with E\Yi\ < oo. Assume 
N n — > oo. T/ien almost surely, for all x E S 

rh n>L (x) -> m L (x). (13) 

// £/ie support S of fi is a bounded subset of R d then almost surely 

sup |m„ )L (x) - tol(x)| -> 0. (14) 
xes 

If either (i) \Y\ < D < oo almost surely (D need not be known) or (ii) /i is of bounded 
support then 



(m njL (x) - m L (x)) fi(dx) -> 0. (15) 
Proof First we prove that almost surely, for all x £ S, and for all k > 1, 

lim \M k , n (x) -M k (x)\ = 0. (16) 



By the ergodic theorem, as n — > oo, a.s., 

P(X G A M ) = fj,(A kji ). 



S"=i l{x,-eA M } 



Similarly, 

El=1 1{X f Ak « }Y > - E{Yl {XeAkA} ) = f m(zMdz), 

which is finite since E\Y\ is finite. Since there are countably many A k>i , almost surely, 
for all Ak,i E U V V V for which n{A k ^) > 0: 



E"=i l {x J eA k , 1 }Y j 



E(Y\X E A Kl ). 



Since for each x E S, /i(A k (x)) > and for some index i, A k (x) = A k>i) we have proved 
(16). 

Particularly, almost surely, for all x E S, and for all k > 2, 

M hn (x) - Mx(x) (17) 

and 

A k ^ L {x) -> A fc>£ (a;). (18) 
Let integer i? > 1 be arbitrary. Let n be so large that iV n > R. For all x E S, 

\™> njL (x) - m L (x)\ 

N n oo 

< \M 1 , n (x)-M 1 (x)\ + '£\A k>ntL (x)-A ktL (x)\+ £ |A fc)L (a;)| 

fc=2 k=N n +l 
R oo 

< |M ljn (x)-M 1 (x)| + ^|A M;L (x)-A fc>L (x)|+ £ (|A M , L (x)| + |A fe , L (x)|) 

fc=2 fc=R+l 
i? oo 

< |M 1 , n (x)-M 1 (x)| + ^|A M , L (x)-A fe , L (x)|+2L X! 2~ fe 

fc=2 fc=R+l 

< |M 1>n (x) - M 1 (x)\ + f; |A fc , n , L (x) - A k>L (x)\ + Ll'^. (19) 

fc=2 

By (17) and (18), almost surely, for all x E S, 

R 

\Mi, n {x) - M 1 (x)\ + £ \&k,n,L(x) - A k , L (x)\ -> 0. (20) 

k=2 
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By (19), almost surely, for all x e S, 

limsup \fh njL (x) - m L (x)\ < L2~ (R - l) . (21) 

n^oo 

Since R was arbitrary, (13) is proved. 

Now we prove (14). Assume the support S of /i is bounded. Let Ak denote the set 
of hyper-cubes from partition P fc with nonempty intersection with S. That is, define 

A k = {AeV k : AnS^®}. (22) 

Since S is bounded, Ak is a finite set. For A E Vk let a(A) be the center of A. Then 
almost surely, 

SUp (\Ml, n {x) -M 1 (x)\ + J2 \KnA X ) ~ A kA X )\] 
\ k=2 ) 

< max |M 1>B (a(A)) - M x (a(A))| + f] max |A fc>B>L (o(A)) - A fc>L (a(A))| (23) 
-> (24) 

keeping in mind that only finitely many terms are involved in the maximization operation. 
The rest of the proof goes virtually as before. 
Now we prove (15). 

\rh n>L (x) - m L {x)\ 2 < 2 ( \M^ n {x) - M^x)] 2 + {M^x) + ^ A k , n , L (x) - m L {x)\ 2 \ . 

\ k=2 / 

If condition (i) holds, then for the first term we have dominated convergence 

\M 1 , n (x)-M 1 (x)\ 2 <(2D) 2 : 



and for the second one, too: 



\ M l( X ) + E A k,n, L ( X ) ~ m L (x)\ 



k=2 

oo 



< £(|A Mii (x)| + |A fc> L(x)|) 

k=2 

< L, 
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and thus (15) follows by Lebesgue's dominated convergence theorem, 

= / lim \m n Ax) — mAx)\ 2 a(dx) = lim / \m n Ax) — mAx)\ 2 a(dx) 

J n-^oo > \ / n^oo J ' 

almost surely. If condition (ii) holds then (15) follows from (14). 
□ 

Corollary 1 Assume m{x) is Lipschitz continuous with Lipschitz constant C . With the 
choice of L > C\fd, for all x E S, m^x) = m(x) and Theorem 1 holds with m^x) 
replaced by m(x). 

Proof Since m(x) is Lipschitz with constant L/y/d, for x E S, 
\M k (x)-m(x)\ < | — ( . . rr m(x) 

fl{A k {x)) JA k {x) 

< ,1, rr / (L/v / rf)(2- fe - 2 V / rf)/i(rf|/) 
/i(A fe (x)) 7A fc (x) V 

= L2" fc - 2 
and M k (x) — > m(x). For rr G S we get 

|M fc (ar) - M fc _i(a:)| < \M k (x) - m(x)\ + \m(x) - M fc _i(a:)| 

< L2~ k - 2 + L2~ k - 1 

< L2- k . 

Thus m(x) = M^x) + J2T=2^k(x) and A kjL (x) = A k (x) for all x G 5. Hence for all 

x & S, 

oo oo 

m L (x) = Mi(x) + X! A fe , L (x) = M x (x) + M*) = m(x) 

k=2 k=2 

and Corollary 1 is proved. 
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□ 

Remark 1. If there is no truncation, that is if L = oo, then rh n = M Nn ^ n . In this case, 
m n is the standard partitioning estimate (defined, for example in [14]). It is known that 
there is an ergodic process (X^ Yj) with Lipschitz continuous m(x) with constant C = 1 
such that a classical partitioning estimate is not even weakly consistent, (cf. Gyorfi, 
Morvai, Yakowitz [16]). 

Remark 2. Our consistency is not universal, however, since m is hypothesized to be 
Lipschitz continuous. 

Remark 3. N n can be data dependent, provided N n — > oo a.s. 

Remark 4. The methodology here is applicable to linear auto-regressive processes. Let 
{Zi} be i.i.d. random variables with EZ = and Var(Z) < oo. Define 

W n+1 = ai W n + a 2 W n . 1 + ... + a K W n - K+1 + Z n+1 (25) 

where Ylf =1 |aj| < 1. Equation (25) yields a stationary ergodic solution. Assume K < d. 
Let Y n+l = W n+1 , and X n+l = (W n , W n _ d+l ). Now 

m(X n+1 ) = E(Y n+1 \X n+1 ) = E(W n+1 \W n ,...,W n - d+1 ) 

= a 1 W n + a 2 W n _ 1 + . . .a K W n _ K+1 . 

The regression function m(x) is Lipschitz continuous with constant C — 1, since for 
x = (xi, . . . , x d ) and z = (zi, . . . , z d ), 

K 

\m(x) — m(z)\ < y~] \a.i\\xi — Zi\ < max \xi — Zi\ < \\x — z\\. 

Ki<d 

i=l 

3 Truncated kernel estimation 

Let K(x) be a non-negative continuous kernel function with 

bl{xes , r } < K(x) < l{xes 0tl }, 
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where < b < 1 and < r < 1. ( S zr denotes the closed ball around z with radius r.) 



Choose 



and 



h k = 2 



-k-2 



M*( x ) - gg^gjj _ SmjzW^Mdz) 



Let 



Al(x) = M*(x)-M^ 1 (x). (27) 

As a motivation, we note that Devroye [9] yields (4), and therefore (6), too. Now for 
k > 2, define 

Al L (x) = sign(M* k (x) - M fe *„ 1 (a;))min(|M fe *(x) - M^*)), L2~ k ). (28) 

Define 

oo 

ml(z):=Ml(x) + Y,AZL( x )- ( 29 ) 

i=2 

Put 

1V1 k,n\ X ) ■- ^/X^-XN 



where we use the convention that 0/0 = 0. Now for k > 2, introduce 

Al >B|L (x) = sign(M* n (x) - JWJ_ 1>B (x)) min(|M* n (x) - M fc *_ 1>n (x)|, L2~ fc ) (30) 

and 

m; L (x) = M* hn (x) + Ku,l( x )- ( 31 ) 



k=2 
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Redefine the support S of /i as 

S :={xeR d : n(S Xtl/k ) > for all k > 1}. (32) 
By Cover and Hart [7], fi(S) = I. 

Theorem 2 Let {(JQ,Y^)} &e a stationary ergodic time series with E\Yi\ < oo. Assume 
N n — > oo. T/ien almost surely, for all x E S, 

™*n, L { X ) (33) 

// £/ie support S of /i is a bounded subset of R d then almost surely 

sup |m* - m* L {x) | -> 0. (34) 

If either (i) \Y\ < D < oo almost surely (D need not be known) or (ii) ji is of bounded 
support then 

J(m: tL (x)-m* L (x)) 2 ^dx)^0. (35) 
Proof We first prove that (16) holds with M£ n and M£. Let 



and 



Similarly put 



and 



g k (x) = E(YK( 



X -x 
h k 



Hx) = ek( } 



V h k 
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We have to show that almost surely, for all k > 1, and for all x G S, both gk, n (%) ~^ 9k(x) 
and fk, n (x) — * fk{x). Consider gk, n (x) with k fixed. Let Q C R d denote the set of vectors 
with rational coordinates. (Note that the set Q has countably many elements.) By the 
ergodic theorem, almost surely, for all r e Q, 

9k,n{ r ) 9k(r). 

Let 5 > be arbitrary. Let integers Z — l > M > be so large that E (\y\l{x£s M }) < §■ 
By ergodicity, almost surely, 

1 n 

sup \g k ,n{x)\ < -J2\Yi\l {x ^s 0tM } -> E (\Y\l {xts } ) < 5. 
*ts ,z n i=i v 7 

Since K hk (x) = K{j^) is continuous and K hk (x) = if \\x\\ > h k and hence K hk (x) is 
uniformly continuous on R d . Define 

U k {u) = sup \K hk {x) - K hk {z)\. 

x,z£R d :\\x—z\\<u 

Let C So,z HQ be a finite subset of vectors with rational coordinates such that 

sup mmUk(\\x — r\\) < 5. 

For x G So,z, let r(x) denote one of the closest rational vector r G Bs to x. Now 

sup |0k,„(a;) -0k, m (a;)| < sup |^, n (x) - g k , n (r(x))\ 

+ sup \gk,n( r ( x )) - 9k,m(r{x))\ 

xGSo,z 

+ sup |^ fc>m (r(a;)) - 9k,m(x)\ 
xes 0iZ 

j n 1 n 

- 1^1 + max|5( fcim (r) - gv(r)| + 5— ^ 

Combining the results, by the ergodic theorem, for almost all u G f2, there exists iV(c<j) 
such that for all m > N, and n > N, 

sup \g k:n (x) - g k:Tn (x)\ < sup - g k ,m(x)\ 

xeR d xeS 0iZ 
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+ SUp \g k ,n( X ) - 9k,m(x) 

< 25E\Y\+35. 



Since 5 was arbitrary, for almost all uj G Q, for every e > 0, there exists an integer N e (u) 
such that for all m > N e (u), n > N e (u): 



sup \gk,n( x ) -9k,m(x)\ < e. 

x£R d 



(36) 



As a consequence, almost surely, the sequence of functions {gk,n}^Li converges uniformly. 
Since all g k>n are continuous, the limit function must be also continuous. Since almost 
surely, for all r G Q, gk,n( r ) ~^ gk( r ), an d by the Lebesgue dominated convergence g k 
is continuous, the limit function must be g^. Since there are countably many k, almost 
surely, for all k > 1, 

sup \g k , n (x) -g k (x)\^0. 

x£R d 

The same argument implies that almost surely, for all k > 1, 



SUP \fk,n( X ) ~ fk( x ) 
x£R d 



0. 



We have proved (16). The rest of the proof of (33) goes as in the proof of Theo- 
rem 1. Now we prove (34). Since now, by assumption, the support is bounded, and 
since it is closed, and hence it is compact. Now note that there must exist an e > 
such that infres f k (x) > e. (Otherwise, there would be a sequence Xi G S such that 
liminfj^oo f k (xi) = 0. Continuity on a compact set would imply that there would be an 
x G S such that f k (x) = in contradiction to the hypothesis that x G S. ) By uniform 
convergence, for large n, mf x( z s f k ,n(x) > e/2. Thus 



sup 



g k A x ) 9k{x) 



fk,n( X ) fk( x ) 



< sup 

x£S 



gk,n( X )(fk( x )/fk,n( X )) ~ 9k( x ) 



fk(x) 
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Thus almost surely, for all k > 1, 



sup|M* n (x)-M*(x)HO. 



Almost surely, for arbitrary integer R > 2, 



sup (\M* 1>n (x) - M*(x)\ + f] |A* in>L (x) - A* >L (x)| 




0. 



The rest of the proof goes exactly as in Theorem 1. 
□ 

Corollary 2 Assume m(x) is Lipschitz continuous with Lipschitz constant C . With the 
choice of L > C for allx G S , m* L (x) = m(x) and Theorem 2 holds with m* L (x) substituted 
by m(x). 

Proof Since m(x) is Lipschitz with constant C, for x G S, 



\Ml{x)-m{x)\ < 




< 



f\m(z)-m(x)\K(^(dz) 




< 



Ch k 



< 



L2 



-k-2 



therefore 



\M* k (x) 



M* k _ 1 (x)\<L2- k . 



The rest of the proof goes as in Corollary 1. 



□ 
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4 Conclusions 



This contribution is part of a long-standing endeavor of the authors to extend nonpara- 
metric forecasting methodology to the most lenient assumptions possible. The present 
work does push into new territory: strong consistency for finite regression under a Lips- 
chitz assumption. The computational aspects have not been explored, but the algorithms 
are so close to their traditional partitioning and kernel counterparts that it is evident that 
they could be implemented and in fact, might be competitive. 

The fundamental formula (8) leading to the truncated histogram approach was mo- 
tivated by a representation used in a related but non-constructive setting by Kieffer [17]. 
The essence is to see that an infinite-dimensional nonparametric space may sometimes be 
decomposed into sums of terms in finite dimensional spaces, with tails of the summations 
being a priori asymptotically bounded over the regression class of interest. Through 
different devices, two ideas for obtaining such tail bounds for the partition and kernel 
methods have been presented. 

Our contribution has been to apply the idea with Lipschitz continuity assuring the 
negligibility. Thus, results here are fundamentally intertwined with the Lipschitz bounds. 
Perhaps other useful expansions are possible. The interplay of finite subspaces and a 
priori bounded tails has proven a bit delicate. Sections 2 and 3 present different attacks 
to the error-bounding problem. The obvious nearest-neighbor estimator did not yield 
to this technique because the radii are random and do not necessarily decrease rapidly 
enough to assure bounded tails. The device which was successful here may find other 
applications. Evidently, a similar investigation could be carried out for regression classes 
having Fourier expansions with coefficients vanishing sufficiently quickly. 

It is well-known (e.g., [33]) that universal convergence rates under the generality 
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of mere ergodicity do not exist. An avenue which would be worth exploring is that 
of adapting universal algorithms, such as explored and referenced here, so that they 
asymptotically attain the fastest possible convergence if, unknown to the statistician, the 
time series happens to fall into a mixing class. The design should be such that consistency 
is still assured if mixing rates do not hold. 
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